getwd()[1] "/Users/niederha/Library/CloudStorage/OneDrive-OregonHealth&ScienceUniversity/teaching/BSTA 526/W26/0_webpage_W26/BSTA_526_W26"
.Rproj file) locationgetwd()dir()here::here() to the rescue!here::here()here::here()ggplot2 for data viz
haven (Optional)
BSTA 526: R Programming for Health Data Science
Meike Niederhausen, PhD & Jessica Minnier, PhD
OHSU-PSU School of Public Health
January 15, 2026
January 15, 2026
In this session, we’ll continue our introduction to R by working with a large dataset that more closely resembles that which you may encounter while analyzing data for research.
Remember to save this notebook under a new name, such as part_02_b526_YOURNAME.qmd.
By the end of this session, you should be able to:
data.frame with here::here()data.frames, especially variables.summary(), skim::skimr, rstatix::get_summary_stats(),visdat, andgtsummary::tbl_summary()` to gain an overview of your dataggplot2We’ll primarily be focusing on using functions from the tidyverse package (aka library).
Base R refers to the core R language and the set of functions that come with R out of the box, before loading any additional packages.setup for what packages are loaded above.pacman::p_load()The p_load() function from pacman package
pacman::p_load() function instead of using the library() functionp_load() loads your packages AND installs them if they aren’t already installed.update = TRUE argument to also update packages,Goal: load the Excel file tcga_clinical_data.xlsx
But first… where are our files???
Where on your computer are you saving your files??
Before importing data into R, you need to know:
.Rproj file) located?.Rproj file) locationroot directory of the project.BSTA_526_W26_class_materials_public
.Rproj file called BSTA_526_W26_class_materials_public.Rprojgetwd()You can use getwd() (get working directory) to see your root folder location:
[1] "/Users/niederha/Library/CloudStorage/OneDrive-OregonHealth&ScienceUniversity/teaching/BSTA 526/W26/0_webpage_W26/BSTA_526_W26"
getwd() within RStudio.getwd() on your rendered html file.Did you get the same working directories???
If you are looking at the html file posted on the class webpage, you will see that the root folder is BSTA_526_W26 and not the shared OneDrive folder.
dir()dir() lists the files in a directory (aka folder) [1] "_extensions"
[2] "_freeze"
[3] "_quarto.yml"
[4] "about.qmd"
[5] "BSTA_526_W26.Rproj"
[6] "data"
[7] "docs"
[8] "follow-up-plot.jpg"
[9] "function_week"
[10] "function_week.qmd"
[11] "images"
[12] "index.qmd"
[13] "minty_adapt2.scss"
[14] "part0"
[15] "part1"
[16] "part2"
[17] "resources"
[18] "schedule_class_dates.R"
[19] "schedule_days.xlsx"
[20] "schedule.qmd"
[21] "sky_modified_smaller_font_B526_W26.scss"
[22] "styles.css"
[23] "survey_feedback_previous_years.qmd"
[24] "syllabus.qmd"
[25] "weeks"
dir() within RStudio.dir() on your rendered html file.Do you see the same files???
If you are looking at the html file posted on the class webpage, the working directory is the Quarto project used to create the class webpage, and dir() lists all the files in the main folder for the webpage.
part_02_b526.qmd is in the folder part2,
BSTA_526_W26_class_materials_publicpart2/part_02_b526.qmdUsername/Dropbox/School/BSTA526/BSTA_526_W26_class_materials_public/part2/part_02_b526.qmd
Goal: load the Excel file tcga_clinical_data.xlsx
datadata/ folder within the part2 folder, and you will find the file tcga_clinical_data.xlsx.What is the relative file path of the Excel file?
part2/data/tcga_clinical_data.xlsx
Goal: load the Excel file tcga_clinical_data.xlsx using its relative file path
read_excel() function from the readxl package#| eval: false and Render the qmd file. Did it work?What’s going on??? 🤯
#| echo: fenced (only visible in the qmd file).here::here() to the rescue!In the various examples we’ve been noticing discrepancies in what is considered the working directory, depending on if we are running code directly within RStudio or rendering the qmd file.
To make our workflow easier, we use there here() function from the here package (hence here::here()).
here(), the R project folder location is ALWAYS the root location,here() below to check what your root location is (make sure it’s what you want it to be)[1] "/Users/niederha/Library/CloudStorage/OneDrive-OregonHealth&ScienceUniversity/teaching/BSTA 526/W26/0_webpage_W26/BSTA_526_W26"
When working with a Quarto webpage project, the working directory is always the project folder (root) no matter in which folder the qmd file is. When rendering a single qmd file, RStudio is actually rendering the whole website and not just the one file.
here::here()" ",/, and" " is the dataset name.The code above will run regardless of whether you are running it within RStudio or rendering the qmd file. 😁
here::here()setwd() (and why to use projects instead)here::here() can only direct to files that are within your R project folder.
brca_clinical appear in the Environment window panel in RStudio.brca_clinical, a new tab will appear next to your qmd script containing the dataset.Clicking on this spreadsheet icon in the Environment window is a shortcut for running
View(brca_clinical); you’ll see this code appear in the Console after clicking.
These data are clinical cancer data from the National Cancer Institute’s Genomic Data Commons, specifically from The Cancer Genome Atlas, or TCGA.
Each row represents a patient
Each column represents patient information, such as demographics (race, age at diagnosis, etc) and disease (e.g., cancer type).
Hints: For the questions below,
read_excel() (from the readxl package), which is installed as part of the tidyverse.read_excel() options were used when loading the data and what do they do?sheet (tab) within an excel file as a number, or can we refer to it as the sheet name instead? (Try doing this)range argument do?readxl package is installed as part of the tidyverse. The setup code chunk loads the readxl package in addition to the tidyverse package though. Was that necessary, or could we have just loaded the tidyverse package?Inspect, and import the following sheets (tabs) from the tcga_clinical_data.xlsx excel file.
Confirm that you have loaded them correctly by clicking on the objects in the Environment pane.
CESC sheet. Save it as cesc_clinical.LUSC sheet. Save it as lusc_clinical.# change eval: false to true in the code chunk options
# Load cesc_clinical
# What should be the sheet argument?
# Do you need to skip a line?
cesc_clinical <- read_excel(here::here("part2", "data", "tcga_clinical_data.xlsx"),
sheet = ____,
skip = ___,
na = "NA"
)
# Load lusc_clinical:
lusc_clinical <- Two options:
If using the interactive pop-up window to load your data, make sure to copy the code to your qmd file.
Otherwise the qmd file doesn’t load the data and your code doesn’t work!
Importing data can be tricky and frustrating.
We will be covering loading in another format that you can export from many different software programs, the comma separated value format, or csv format in another section.
I highly recommend reading the introduction/vignette for the readxl package and looking at the cheatsheet.
In our example:
This format is called Tidy Data, and it lets us do all sorts of things in R successfully.
Transpose is your Friend in Excel. If your data isn’t in the tidy format, no worries! You can copy it to a new sheet and use the transpose option when you’re pasting it, and then load that in.
Every column needs a name. Every one of your columns should be named at the top, and should begin with a letter. Numbers and special characters can cause errors in your data analysis pipeline.
Color information is hard to get into R. Avoid using color coding of cells if that is extra information attached to a cell. Instead, make the information the color is representing its own column.
Extra Lines are OK! Extra lines (rows) above the column header row are ok, as you’ve seen. It’s sometimes better to have a “notes” sheet where you put extra information or, better yet, a data dictionary. (Extra lines at the end of data are more difficult to deal with, and often unexpected!)
Why are these data untidy? How would you make them tidy?
You will learn how to tidy these type of data!
untidy_data2 <- tibble(
name = c("Ana","Bob","Cara"),
wt_07_01_2018 = c(100, 150, 140),
wt_08_01_2018 = c(104, 155, 138),
wt_09_01_2018 = c(NA, 160, 142)
)
untidy_data2# A tibble: 3 × 4
name wt_07_01_2018 wt_08_01_2018 wt_09_01_2018
<chr> <dbl> <dbl> <dbl>
1 Ana 100 104 NA
2 Bob 150 155 160
3 Cara 140 138 142
Spreadsheets, for all of their mundane rectangularness, have been the subject of angst and controversy for decades.
data.framesNow that we have data imported and available, we can start to inspect the data more closely.
data.frame, which is a data structure (that is, a way of organizing data) that is analogous to tabular or spreadsheet style data.data.frame is:
character, numeric).data.frame, however, can include vectors (columns) of different data types. This makes them an extremely versatile way of storing information.readxl::read_excel), we use a tibble,
data.frame with some perks, that you don’t need to worry about.Most useful and obvious differences:
stringsAsFactors = FALSE!) - although newer versions of R no longer do thisdata.frame# A tibble: 6 × 16
submitter_id primary_diagnosis tumor_stage disease age_at_diagnosis
<chr> <chr> <chr> <chr> <dbl>
1 TCGA-3C-AAAU C50.9 stage x BRCA 20211
2 TCGA-3C-AALI C50.9 stage iib BRCA 18538
3 TCGA-3C-AALJ C50.9 stage iib BRCA 22848
4 TCGA-3C-AALK C50.9 stage ia BRCA 19074
5 TCGA-4H-AAAK C50.9 stage iiia BRCA 18371
6 TCGA-5L-AAT0 C50.9 stage iia BRCA 15393
# ℹ 11 more variables: vital_status <chr>, morphology <chr>,
# days_to_death <dbl>, days_to_birth <dbl>,
# site_of_resection_or_biopsy <chr>, days_to_last_follow_up <dbl>,
# gender <chr>, year_of_birth <dbl>, race <chr>, ethnicity <chr>,
# year_of_death <dbl>
# A tibble: 6 × 16
submitter_id primary_diagnosis tumor_stage disease age_at_diagnosis
<chr> <chr> <chr> <chr> <dbl>
1 TCGA-WT-AB41 C50.9 stage iib BRCA NA
2 TCGA-WT-AB44 C50.9 stage ia BRCA NA
3 TCGA-XX-A899 C50.9 stage iiia BRCA 17022
4 TCGA-XX-A89A C50.9 stage iib BRCA 25000
5 TCGA-Z7-A8R5 C50.9 stage iiia BRCA 22280
6 TCGA-Z7-A8R6 C50.8 stage i BRCA 16955
# ℹ 11 more variables: vital_status <chr>, morphology <chr>,
# days_to_death <dbl>, days_to_birth <dbl>,
# site_of_resection_or_biopsy <chr>, days_to_last_follow_up <dbl>,
# gender <chr>, year_of_birth <dbl>, race <chr>, ethnicity <chr>,
# year_of_death <dbl>
What happens if you just type brca_clinical in the console window?
We often need to reference the names of variables (also known as columns) in our data.frame, so it’s useful to print only those to the screen:
[1] "submitter_id" "primary_diagnosis"
[3] "tumor_stage" "disease"
[5] "age_at_diagnosis" "vital_status"
[7] "morphology" "days_to_death"
[9] "days_to_birth" "site_of_resection_or_biopsy"
[11] "days_to_last_follow_up" "gender"
[13] "year_of_birth" "race"
[15] "ethnicity" "year_of_death"
[1] "submitter_id" "primary_diagnosis"
[3] "tumor_stage" "disease"
[5] "age_at_diagnosis" "vital_status"
[7] "morphology" "days_to_death"
[9] "days_to_birth" "site_of_resection_or_biopsy"
[11] "days_to_last_follow_up" "gender"
[13] "year_of_birth" "race"
[15] "ethnicity" "year_of_death"
It’s also possible to view row names using rownames(brca_clinical), but our data only possess numbers for row names so it’s not very informative.
glimpse() what’s in the data frameWe can use glimpse() to provide a general overview of the object:
Rows: 1,095
Columns: 16
$ submitter_id <chr> "TCGA-3C-AAAU", "TCGA-3C-AALI", "TCGA-3C-A…
$ primary_diagnosis <chr> "C50.9", "C50.9", "C50.9", "C50.9", "C50.9…
$ tumor_stage <chr> "stage x", "stage iib", "stage iib", "stag…
$ disease <chr> "BRCA", "BRCA", "BRCA", "BRCA", "BRCA", "B…
$ age_at_diagnosis <dbl> 20211, 18538, 22848, 19074, 18371, 15393, …
$ vital_status <chr> "alive", "alive", "alive", "alive", "alive…
$ morphology <chr> "8520/3", "8500/3", "8500/3", "8500/3", "8…
$ days_to_death <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ days_to_birth <dbl> -20211, -18538, -22848, -19074, -18371, -1…
$ site_of_resection_or_biopsy <chr> "C50.9", "C50.9", "C50.9", "C50.9", "C50.9…
$ days_to_last_follow_up <dbl> 4047, 4005, 1474, 1448, 348, 1477, 1471, 3…
$ gender <chr> "female", "female", "female", "female", "f…
$ year_of_birth <dbl> 1949, 1953, 1949, 1959, 1963, 1968, 1947, …
$ race <chr> "white", "black or african american", "bla…
$ ethnicity <chr> "not hispanic or latino", "not hispanic or…
$ year_of_death <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
The output provided by glimpse() includes:
$, and includes the
We can also find out some under the hood information about the data frame:
Here we see the tbl_df and tbl classes are here, as is data.frame. It is all of these things!
$The base R way to get at a column in a data.frame is to use the $ operator.
[1] "stage x" "stage iib" "stage iib" "stage ia" "stage iiia"
[6] "stage iia"
We won’t be using this very often (in fact, part of the reason to use the tidyverse is that we can avoid using this), but I just want you to be aware of it because it’s used in a section below, and it is useful knowledge.
summary()Use this base R function to examine basic summary statistics for each column:
submitter_id primary_diagnosis tumor_stage disease
Length:1095 Length:1095 Length:1095 Length:1095
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
age_at_diagnosis vital_status morphology days_to_death
Min. : 9706 Length:1095 Length:1095 Min. : 116.0
1st Qu.:18032 Class :character Class :character 1st Qu.: 689.2
Median :21562 Mode :character Mode :character Median :1223.0
Mean :21587 Mean :1643.0
3rd Qu.:24862 3rd Qu.:2370.0
Max. :32872 Max. :7455.0
NA's :16 NA's :945
days_to_birth site_of_resection_or_biopsy days_to_last_follow_up
Min. :-32872 Length:1095 Min. : -31.0
1st Qu.:-24862 Class :character 1st Qu.: 434.2
Median :-21562 Mode :character Median : 760.5
Mean :-21587 Mean :1187.2
3rd Qu.:-18032 3rd Qu.:1583.2
Max. : -9706 Max. :8605.0
NA's :16 NA's :105
gender year_of_birth race ethnicity
Length:1095 Min. :1902 Length:1095 Length:1095
Class :character 1st Qu.:1940 Class :character Class :character
Mode :character Median :1950 Mode :character Mode :character
Mean :1949
3rd Qu.:1960
Max. :1984
NA's :1
year_of_death
Min. :1992
1st Qu.:2001
Median :2005
Mean :2004
3rd Qu.:2008
Max. :2013
NA's :991
year_of_death), this output includes common statistics like median and mean, as well as the number of rows (patients) with missing data (as NA).character variables, we don’t learn very much from the summary() output.factors (grouped categorical variables) you’re given a count of the number of times the top six most frequent factors (categories) occur in the data frame.
factor variables just yet.factors in the next classes.{skimr}The skimr package is really useful for getting an overview of your dataset.
skimr package is called skim().skim() give us?# skimr is already loaded, this is just to remind you one way of loading packages
# library(skimr)
skim(brca_clinical)| Name | brca_clinical |
| Number of rows | 1095 |
| Number of columns | 16 |
| _______________________ | |
| Column type frequency: | |
| character | 10 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| submitter_id | 0 | 1 | 12 | 12 | 0 | 1095 | 0 |
| primary_diagnosis | 0 | 1 | 5 | 7 | 0 | 7 | 0 |
| tumor_stage | 0 | 1 | 7 | 12 | 0 | 13 | 0 |
| disease | 0 | 1 | 4 | 4 | 0 | 1 | 0 |
| vital_status | 0 | 1 | 4 | 5 | 0 | 2 | 0 |
| morphology | 0 | 1 | 6 | 6 | 0 | 22 | 0 |
| site_of_resection_or_biopsy | 0 | 1 | 5 | 5 | 0 | 6 | 0 |
| gender | 0 | 1 | 4 | 6 | 0 | 2 | 0 |
| race | 0 | 1 | 5 | 32 | 0 | 5 | 0 |
| ethnicity | 0 | 1 | 12 | 22 | 0 | 3 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age_at_diagnosis | 16 | 0.99 | 21587.14 | 4815.24 | 9706 | 18031.50 | 21562.0 | 24862.50 | 32872 | ▂▆▇▅▂ |
| days_to_death | 945 | 0.14 | 1642.97 | 1320.00 | 116 | 689.25 | 1223.0 | 2370.00 | 7455 | ▇▃▂▁▁ |
| days_to_birth | 16 | 0.99 | -21587.14 | 4815.24 | -32872 | -24862.50 | -21562.0 | -18031.50 | -9706 | ▂▅▇▆▂ |
| days_to_last_follow_up | 105 | 0.90 | 1187.21 | 1178.82 | -31 | 434.25 | 760.5 | 1583.25 | 8605 | ▇▂▁▁▁ |
| year_of_birth | 1 | 1.00 | 1949.41 | 13.62 | 1902 | 1940.00 | 1950.0 | 1960.00 | 1984 | ▁▃▇▇▂ |
| year_of_death | 991 | 0.09 | 2004.49 | 4.44 | 1992 | 2001.00 | 2005.0 | 2008.00 | 2013 | ▁▃▆▇▅ |
rstatix package: get_summary_stats()rstatix package has a useful function called get_summary_stats() for quick summary statistics of numeric variablestype options are
type = c("full", "common", "robust", "five_number", "mean_sd", "mean_se", "mean_ci", "median_iqr", "median_mad", "quantile", "mean", "median", "min", "max")# library(rstatix)
brca_clinical %>%
get_summary_stats(
age_at_diagnosis, days_to_death,
type = "common"
) # A tibble: 2 × 10
variable n min max median iqr mean sd se ci
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 age_at_diagnosis 1079 9706 32872 21562 6831 21587. 4815. 147. 288.
2 days_to_death 150 116 7455 1223 1681. 1643. 1320. 108. 213.
Note the output is a tibble!
You can also stratify the statistics using group_by()
# A tibble: 11 × 11
tumor_stage variable n min max median iqr mean sd se ci
<chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 not reported days_to_… 4 2965 7455 3194 1224. 4202 2172. 1086. 3456.
2 stage i days_to_… 13 563 3959 2009 1477 2201. 1149. 319. 694.
3 stage ia days_to_… 3 295 3926 1688 1816. 1970. 1832. 1058. 4550.
4 stage iia days_to_… 34 158 6456 1210 1488. 1610. 1249. 214. 436.
5 stage iib days_to_… 30 160 6593 1682. 1980 2114. 1461. 267. 545.
6 stage iii days_to_… 2 616 1649 1132. 516. 1132. 730. 516. 6563.
7 stage iiia days_to_… 25 172 3461 1004 1178 1192. 947. 189. 391.
8 stage iiib days_to_… 8 377 1642 743 438 814. 406. 143. 339.
9 stage iiic days_to_… 9 197 2636 548 613 877. 842. 281. 647.
10 stage iv days_to_… 15 116 3941 825 904. 1194. 1161. 300. 643.
11 stage x days_to_… 7 558 2965 1781 1328. 1792. 913. 345. 844.
Add on gt() from the gt package for a nicer looking table:
brca_clinical %>%
group_by(tumor_stage) %>%
get_summary_stats(
days_to_death,
type = "common"
) %>%
gt()| tumor_stage | variable | n | min | max | median | iqr | mean | sd | se | ci |
|---|---|---|---|---|---|---|---|---|---|---|
| not reported | days_to_death | 4 | 2965 | 7455 | 3194.0 | 1224.5 | 4202.000 | 2172.062 | 1086.031 | 3456.235 |
| stage i | days_to_death | 13 | 563 | 3959 | 2009.0 | 1477.0 | 2200.923 | 1149.244 | 318.743 | 694.481 |
| stage ia | days_to_death | 3 | 295 | 3926 | 1688.0 | 1815.5 | 1969.667 | 1831.814 | 1057.598 | 4550.478 |
| stage iia | days_to_death | 34 | 158 | 6456 | 1210.0 | 1487.5 | 1610.353 | 1248.658 | 214.143 | 435.677 |
| stage iib | days_to_death | 30 | 160 | 6593 | 1682.5 | 1980.0 | 2114.500 | 1460.723 | 266.690 | 545.443 |
| stage iii | days_to_death | 2 | 616 | 1649 | 1132.5 | 516.5 | 1132.500 | 730.441 | 516.500 | 6562.755 |
| stage iiia | days_to_death | 25 | 172 | 3461 | 1004.0 | 1178.0 | 1191.960 | 947.345 | 189.469 | 391.045 |
| stage iiib | days_to_death | 8 | 377 | 1642 | 743.0 | 438.0 | 814.375 | 405.556 | 143.386 | 339.054 |
| stage iiic | days_to_death | 9 | 197 | 2636 | 548.0 | 613.0 | 877.333 | 841.552 | 280.517 | 646.874 |
| stage iv | days_to_death | 15 | 116 | 3941 | 825.0 | 903.5 | 1194.200 | 1160.922 | 299.749 | 642.897 |
| stage x | days_to_death | 7 | 558 | 2965 | 1781.0 | 1327.5 | 1791.571 | 912.567 | 344.918 | 843.984 |
cesc_clinical. Try running this both in a code chunk and the console window.days_to_last_known_disease_status and others with type logical?height, and does the distribution look symmetric or skewed?visdat packagevisdat.naniar [more in part 9-10]I recommended Exploring missing values in naniar in the optional readings.
I highly recommend the naniar package for creating missingness visualizations, and the link above is a great tutorial with an introduction to some really useful functions for this. It imports the vis_miss function and adds functionality for missingness
gtsummary::tbl_summary() for tables [more in part 9-10]We will learn how to use this more effectively later, but here’s a quick way to see useful information, but possibly too much information (what would you output differently if you knew how?):
brca_clinical %>%
select(-submitter_id) %>% # do not include this variable in the table - why???
gtsummary::tbl_summary() # note, use :: here just to highlight that this is in the gtsummary package; not necessary!| Characteristic | N = 1,0951 |
|---|---|
| primary_diagnosis | |
| C50.2 | 2 (0.2%) |
| C50.3 | 3 (0.3%) |
| C50.4 | 2 (0.2%) |
| C50.5 | 1 (<0.1%) |
| C50.8 | 2 (0.2%) |
| C50.9 | 1,084 (99%) |
| C50.919 | 1 (<0.1%) |
| tumor_stage | |
| not reported | 11 (1.0%) |
| stage i | 90 (8.2%) |
| stage ia | 86 (7.9%) |
| stage ib | 7 (0.6%) |
| stage ii | 6 (0.5%) |
| stage iia | 357 (33%) |
| stage iib | 256 (23%) |
| stage iii | 2 (0.2%) |
| stage iiia | 155 (14%) |
| stage iiib | 27 (2.5%) |
| stage iiic | 65 (5.9%) |
| stage iv | 20 (1.8%) |
| stage x | 13 (1.2%) |
| disease | |
| BRCA | 1,095 (100%) |
| age_at_diagnosis | 21,562 (18,025, 24,875) |
| Unknown | 16 |
| vital_status | |
| alive | 944 (86%) |
| dead | 151 (14%) |
| morphology | |
| 8010/3 | 1 (<0.1%) |
| 8013/3 | 1 (<0.1%) |
| 8022/3 | 3 (0.3%) |
| 8050/3 | 2 (0.2%) |
| 8090/3 | 1 (<0.1%) |
| 8200/3 | 1 (<0.1%) |
| 8201/3 | 1 (<0.1%) |
| 8211/3 | 1 (<0.1%) |
| 8401/3 | 1 (<0.1%) |
| 8480/3 | 16 (1.5%) |
| 8500/3 | 778 (71%) |
| 8502/3 | 1 (<0.1%) |
| 8503/3 | 6 (0.5%) |
| 8507/3 | 4 (0.4%) |
| 8510/3 | 6 (0.5%) |
| 8520/3 | 199 (18%) |
| 8522/3 | 28 (2.6%) |
| 8523/3 | 19 (1.7%) |
| 8524/3 | 7 (0.6%) |
| 8541/3 | 3 (0.3%) |
| 8575/3 | 14 (1.3%) |
| 9020/3 | 2 (0.2%) |
| days_to_death | 1,223 (678, 2,373) |
| Unknown | 945 |
| days_to_birth | -21,562 (-24,875, -18,025) |
| Unknown | 16 |
| site_of_resection_or_biopsy | |
| C50.2 | 2 (0.2%) |
| C50.3 | 3 (0.3%) |
| C50.4 | 2 (0.2%) |
| C50.5 | 1 (<0.1%) |
| C50.8 | 2 (0.2%) |
| C50.9 | 1,085 (99%) |
| days_to_last_follow_up | 761 (434, 1,587) |
| Unknown | 105 |
| gender | |
| female | 1,083 (99%) |
| male | 12 (1.1%) |
| year_of_birth | 1,950 (1,940, 1,960) |
| Unknown | 1 |
| race | |
| american indian or alaska native | 1 (<0.1%) |
| asian | 61 (5.6%) |
| black or african american | 183 (17%) |
| not reported | 93 (8.5%) |
| white | 757 (69%) |
| ethnicity | |
| hispanic or latino | 39 (3.6%) |
| not hispanic or latino | 884 (81%) |
| not reported | 172 (16%) |
| year_of_death | 2,005.0 (2,001.0, 2,008.0) |
| Unknown | 991 |
| 1 n (%); Median (Q1, Q3) | |
ggplot2 for data vizThis is just an introduction to ggplot2. We will work towards more advanced plots and customization in future classes.
We’re going to work towards the following graph today:
ggplot2: A Grammar of Graphicsggplot2 is an extremely powerful software package for visualization.gg is short for Grammar of Graphics, which means that visualizations are expressed in a very specific way.Here’s a visual summary of the different parts we’re talking about today.
ggplot2 codeA ggplot2 graphic consists of a:
mapping of variables in data toaes()thetic attributes ofgeom_etric objects.In code, this is translated as:
#start the plot with ggplot()
ggplot(data = brca_clinical) +
# make the mapping
# map the x-axis to age_at_diagnosis
aes(
x = age_at_diagnosis,
y = days_to_birth
) +
# add the geometry
geom_point()
+ (plus sign).and then.aes() function maps variables to visual properties of the graphaes()):brca_clinical to the x, and y aesthetics.Huh. age_at_diagnosis is in days, not years. We can fix that by dividing it by 365:
We can also map a character variable to our graph to color.
Try mapping gender to color:
We can add more details to our graph. We can add a title using the labs() function:
We can change the x-axis titles and the y-axis titles also using the labs() function:
Now we’ve re-created the above plot!
ggsave().ggsave() saves the last created plot to a file.jpg file.ggsave() is smart enough to know that we want to save it as a jpg from adding the extension .jpg to our filename.Where did the figure get saved to?
Speaking of saving figures, check out what’s in the figs folder. How did that happen?
ggplot2 practiceggplot2 skills by taking the first chapter of the R-Bootcamp:
ggplot2 plots,
haven (Optional)The package haven allows us to import data in other software formats, including SAS, Stata, or SPSS data.
havenHere is an example reading in a SAS dataset:
library(haven)
# mtsas <- read_sas("data/mtcars.sas7bdat")
mtsas <- read_sas(here::here("part2", "data", "mtcars.sas7bdat"))
head(mtsas)# A tibble: 6 × 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
read_xpt() functionread_sav()read_dta()These all read in the data as tibbles.
We can also export data into these formats using the write_* versions of these functions such as write_sas().
haven vignettes: https://haven.tidyverse.org/articles/semantics.htmlHere, we have downloaded data from the CDC’s National Ambulatory Medical Care Survey located and described at: https://www.cdc.gov/nchs/ahcd/datasets_documentation_related.htm#data
havenglimpse):Rows: 28,332
Columns: 5
$ VMONTH <dbl+lbl> 10, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 4, 4, 4, 4…
$ VDAYR <dbl+lbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 5, 5, 5, 5, 3, 3…
$ AGE <dbl+lbl> 65, 45, 61, 55, 53, 92, 59, 92, 65, 1, 78, 75, 58, 58, 69…
$ AGER <dbl+lbl> 5, 4, 4, 4, 4, 6, 4, 6, 5, 1, 6, 6, 4, 4, 5, 6, 4, 3, 6, 6…
$ AGEDAYS <dbl+lbl> -7, -7, -7, -7, -7, -7, -7, -7, -7, -7, -7, -7, -7, -7, -7…
When we print one of the columns, we see the labels printed at the bottom:
<labelled<double>[10]>: Patient race - unimputed
[1] -9 1 1 3 -9 1 1 1 1 -9
Labels:
value label
-9 Blank
1 White Only
2 Black/African American Only
3 Asian Only
4 Native Hawaiian/Oth Pac Isl Only
5 American Indian/Alaska Native Only
6 More than one race reported
We can see more detail with str(), including the column label better defining the value “Patient race - unimputed”:
dbl+lbl [1:28332] -9, 1, 1, 3, -9, 1, 1, 1, 1, -9, 1, 1, 1, 1, ...
@ label : chr "Patient race - unimputed"
@ format.spss: chr "F2.0"
@ labels : Named num [1:7] -9 1 2 3 4 5 6
..- attr(*, "names")= chr [1:7] "Blank" "White Only" "Black/African American Only" "Asian Only" ...
haven is not to give you a data set and labels to use for analysis,One such way is to convert these labeled columns to factors with the haven function as_factor():
[1] Blank White Only White Only Asian Only Blank White Only
7 Levels: Blank White Only Black/African American Only ... More than one race reported
[1] "Blank" "White Only"
[3] "Black/African American Only" "Asian Only"
[5] "Native Hawaiian/Oth Pac Isl Only" "American Indian/Alaska Native Only"
[7] "More than one race reported"
as.character() if that works better for our analysis pipeline.Importing labeled data can be challenging. This is a good post that delves more into loaded labeled data with haven:
https://www.pipinghotdata.com/posts/2020-12-23-leveraging-labelled-data-in-r/
here::here() to load datadata.framesggplot2Please fill out the post-class survey.
Your responses are anonymous in that I separate your names from the survey answers before compiling/reading.
You may want to review previous years’ feedback here.
here::here() here.